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FOREWORD 


This  is  one  in  a  continuing  series  of  papers  concerned  with  the  theory  and 
application  of  admissible  probability  measurement  techniques  and  one  of  a 
sub-series  of  papers  concerned  with  the  effects  of  guessing  on  the  inter¬ 
pretation  and  use  of  objective  test  results.  This  paper  constitutes  the 
Second  Semiannual  Technical  Report  of  work  performed  in  support  of  the 
United  States  Air  Force  Office  of  Scientific  Research  contract  number  AF 
*♦9(638)  - 1 ~Jhk  sponsored  by  The  Advanced  Research  Projects  Agency  of  the 
Department  of  Defense  (ARPA  order  number  833) - 


ABSTRACT 

Logic  and  mathematics  ai'e  erq)loued  to  yield  very  conservative  estimates 
of  the  gains  I'esulting  from  chang ing  over  from  choice  methods  to  acini ssible 
probability  measurement  in  the  administration  of  existing  tests . 

Equations  and  graphs  give  test  reliability  and  measurement  validity  as  a 
function  of  the  distribution  of  ability  levels  in  the  population  to  be  test¬ 
ed  at'.d  as  a  function  of  the  amount  and  type  of  guessing  engaged  in  by  this 
population.  Since  guessing  degrades  the  performance  of  choice  tests  and 
since  the  U3C  of  admissible  probability  measurement  eliminates  guessing,  the 
extent  of  degradation  corresponds  to  a  conservative  estimate  of  the  gain  re¬ 
sulting  from  conversion  to  admissible  probability  measurement. 

In  soma  applications  it  may  be  vise  to  trade  off  the  increase  in  measure¬ 
ment  validity  against  the  advantages  of  shortening  the  length  of  the  test. 
Equations  and  gieaphs  shew  haw  muen  shorter  the  new  gues3ing-free  test  can  be 
and  still  retain  the  original  measurement  validity. 

Additional  equations  and  curves  show  that  choice  tests  with  zero  measure¬ 
ment  validizies  can  have  appreciable  reliability' es  due  to  differences  in 
guessitig  strategy  in  the  population. 

All  the  analyses  indicate  that  conversion  to  admissible  probability 
measurement  will  yield  quite  significant  improvements  in  measurement  validity 
along  with  considerable  reductions  in  teat  length. 


INTRODUCTION 


The  recent  development  of  the  theory  of  admissible  probability  measurement 
(Shuford,  Albert  6  Massengill,  1965)  and  its  successful  aopllcation  in  lab¬ 
oratory  settings  (cf.  Toda,  1963;  Shuford,  1965)  promises  to  have  a  profound 
impact  on  the  theory  and  practice  of  objective  and  semi -objective  testing. 
This  new  capability  to  measure  an  Individual's  degree  of  confidence  in  the 
correctness  of  his  answers  to  a  test  item  means  that  a  great  deal  more  infor¬ 
mation  can  be  obtained  about  the  individual's  state  of  knowledge.  The  ad¬ 
ditional  information  yielded  by  this  new  method  of  testing  may  be  utilized 
In  many  different  ways.  Some  of  these  ways,  though  promising  great  benefits 
and  improvements  over  existing  procedures,  imply  the  development  of  new  test 
items  and  structures  and  in  other  cases  creation  of  new  instructional  strat¬ 
egies  and  materials  (Shuford,  1965;  Shuford  6  Massengill,  1966).  There  are 
however,  other  more  immediately  applicable  ways  of  using  this  new  capability 
to  improve  the  performance  of  existing  testing  programs.  Since  admissible 
probability  measurement  can  be  used  with  any  true-false,  multiple-choice  or 
f i 1 1-in-the-blank  test  and  since  as  in  the  case  of  conventional  choice  test¬ 
ing  the  item  scores  can  be  summed  to  obtain  a  total  test  score,  it  is  en¬ 
tirely  feasible  to  change  from  the  conventional  choice  method  over  to  ad¬ 
missible  probability  measurement  in  the  administration  of  an  existing  test. 
Such  a  changeover  docs  rot  require  the  writing  of  any  new  test  items  nor 
does  it  require  the  development  of  new  ways  of  analyzing  and  utilizing  test 
scores.  Can  such  a  simple  and  minimal  change  resul*-  in  any  benefits  to  a 
testing  operation? 

Section  B  of  the  First  Semi-Annual  Technical  Report  (Shuford  5  Massengill, 
1966)  considered  just  this  question  and  arrived  at  an  affirmitive  answer. 

The  report  took  a  very  conservative  approach  to  estimating  the  benefits  to 
be  derived  from  substituting  admissible  probability  measurement  for  choice 
testing.  The  approach  was  conservative  in  two  senses.  First,  it  was  as¬ 
sumed  that  data  analysis  and  personnel  decisions  would  be  taken  only  on  the 
basis  of  total  test  score.  The  use  of  total  test  score  dominates  current 
personnel  measurement  and  testing  practice  because  of  the  unreliability  of 
the  individual  item  scores.  Adding  together  these  scores  from  the  unreli¬ 
able  items,  in  effect,  builds  up  the  sample  size  and  thus  increases  the  re¬ 
liability  of  the  total  test  score.  With  admissible  probability  measurement, 
however,  each  item  score  is  In  a  sense  completely  reliable  so  that  there  is 
no  longer  the  same  compulsion  to  sum  item  scores  to  build  test  reliability. 
One  Is  tempted  rather  to  look  at  the  pattern  of  individual  item  scores  to 
arrive  at  personnel  decisions.  But,  as  stated  above  this  temptation  has  been 
temporarily  resisted.  The  potential  information  inherent  in  the  pattern  of 
item  scores  will  be  sacrificed  to  ceal  only  with  total  test  scores  as  is  the 
current  practice.  This  is  one  respect  in  which  the  analysis  reported  in 
Section  B  of  the  First  Semi-Annual  Technical  Report  is  conservative  in  esti¬ 
mating  the  benefits  from  changing  over  to  admissible  probability  measure¬ 
ment. 

The  other  respect  in  which  the  analysis  is  conservative  is  that  it  is  assum¬ 
ed  that  persons  taking  the  test  either  know  the  answer  to  an  item  with  some 


assurance  or  are  relatively  uncertain  about  the  answer  to  an  Item  and  that 
admissible  probability  measurement  will  be  used  only  to  discriminate 
whether  a  person  knows  or  Is  guessing  at  the  answer  to  an  Item.  The  I  tern 
score  obtained  with  admissible  probability  measurement  typically  ranges  over 
the  continuum  from  minus  one  up  to  plus  one  and  thus  can  be  used  to  make 
fine  discriminations  in  the  person's  state  of  knowledge  concerning  the  Item 
In  question.  This  potentially  useful  Information  Ts,  however,  sacrificed  by 
mapping  all  the  possible  scores  into  two  categories:  one  indicating  that 
the  person  knows  the  answer  to  the  Item;  the  other  Indicating  that  the  per¬ 
son  does  not  know  the  answer  to  the  Item.  This  is  the  second  respect  in 
which  the  I  tern  analysis  reported  In  Section  B  of  the  First  Semi-Annual 
Technical  Report  is  conservative  In  estimating  the  benefits  of  changing  from 
choice  testing  to  admissible  probability  measurement. 

Though  these  two  restrictions  assumed  by  the  analysis  underestimate  the  bene¬ 
fits  to  be  obtained  they  do  result  in  a  very  important  advantage.  One  can 
use  logic  and  mathematics  In  a  very  straightforward  manner  for  a  quantitative 
study  of  the  effects  of  guessing  on  the  quality  of  personnel  and  counseling 
decisions.  It  can  be  determined  just  how  much  guessing  degrades  the  per¬ 
formance  of  any  test  by  comparing  the  performance  of  the  guessing-contami¬ 
nated  test  with  the  performance  of  the  test  freed  of  the  effect  of  guessing. 
The  amount  of  degradation  can  then  be  used  as  an  estimate  of  the  gains  that 
will  result  from  eliminating  guessing  from  the  test  and  changing  it  into  a 
guessing-free  test.  Since  changing  over  to  admissible  probability  measure¬ 
ment  will  eliminate  the  effects  of  guessing  from  the  test  (Massengll!  6 
Shuford,  1967)  the  amount  of  degradation  in  the  performance  of  the  test  due 
to  guessing  now  becomes  the  conservative  estimate  of  the  gains  resulting 
from  substituting  admissible  probability  measurement  for  choice  testing. 

Section  B  of  the  First  Semi-Annual  Technical  Report  used  probability  theory 
and  decision  theory  to  estimate  the  effect  of  guessing  on  the  quality  of 
personnel  and  counseling  decisions  in  most  of  the  major  applications  of 
testing.  In  attempting  to  keep  the  mathematics  as  realistic  as  possible, 
however,  a  price  was  paid  in  that  laborious  numerical  computations  were  re¬ 
quired  to  obtain  any  quantitative  result.  Therefore,  the  study  was  re¬ 
stricted  to  a  ten-item  test  used  with  a  population  of  just  one  distribution 
of  ability  levels,  in  this  respect  the  generality  of  these  results  is  quite 
restricted.  The  results  are  Important,  however,  in  that  they  serve  the  same 
purpose  as  existence  proofs  In  showing  something  is  possible.  Specifically 
they  show  that  guessing  at  the  levels  commonly  encountered  In  practice  seri¬ 
ously  degrades  the  quality  of  selection,  classification  and  placement  deci¬ 
sions  based  on  total  test  score.  Further,  they  show  that  guessing  can  so 
seriously  degrade  the  performance  of  a  test  used  for  educational  and  voca¬ 
tional  counseling  purposes  that  It  is  best  to  abandon  testing  for  this  pur¬ 
pose  and  just  act  as  though  every  person  had  the  same  average  ability  level. 
Additionally  and  less  suprlslngly  the  results  show  that  moderate  levels  of 
guessing  can  seriously  degrade  the  reliability  and  validity  of  0  test.  And 
finally  It  is  shown  that  a  person's  test-wlseness ,  l.e.,  whether  or  not  he 
guesses  on  a  test,  largely  determines  his  chances  of  being  successful  on  the 
tes  t. 


In  summary.  Section  B  of  the  First  Semi-Annual  Technical  Report  outlines  the 
methodology  for  obtaining  conservative  estimates  of  the  gains  from  admis¬ 
sible  probability  measurement  and  shows  that  these  gains  could  be  of  great 
magnitude  in  a  variety  of  areas  of  application.  The  numerical  results  were 
limited  however,  to  a  ten-item  test  used  with  one  population  having  a  cer¬ 
tain  distribution  of  ability  levels,  in  some  cases  It  would  be  useful  to 
extend  these  results  and  to  Increase  their  generality  so  that  they  may  be 
used,  more  effectively  to  guide  decisions  on  whether  or  not  to  change  from 
choice  testing  to  admissible  probability  measurement.  This  Second  Semi- 
Annual  Technical  Report  accomplishes  just  this  for  the  area  of  test  reli¬ 
ability  and  validity. 

t 

RAT  I ONALE 

It  has  been  shown  both  theoretically  (Shuford  £  Massengill,  1966)  and  empiri 
cally  (Shuford,  1965)  that  changing  from  choice  testing  to  admissible  prob¬ 
ability  measurement  can  increase  the  reliability  and  validity  of  a  test. 

This  increased  reliability  and  validity  can  be  Important  In  two  respects. 
First,  the  higher  the  validity  of  the  test  the  better  It  Is  in  the  sense 
that  It  yields  better  Information  which  in  turn  Implies  that  better  deci¬ 
sions  can  be  made  on  the  basis  of  the  test  information.  There  exist  some 
situations,  however,  where  It  Is  better  to  accept  a  shorter  but  less  valid 
test.  The  longer  the  test  the  greater  the  "price"  that  must  be  paid  for  the 
test  information.  Since  the  reliability  and  validity  of  most  tests  can  be 
Increased  by  adding  on  additional  items,  It  Is  apparent  that  a  decision  has 
been  made  at  least  Implicitly  that  the  additional  validity  gained  by  length¬ 
ening  the  test  Is  not  worth  the  cost  of  a  longer  test.  Now,  when  we  think 
of  using  admissible  probability  measurement  with  an  existing  test  we  should 
think  also  of  the  possibility  of  trading  off  the  Increased  test  reliability 
and  validity  against  the  possibility  of  having  a  n.uch  shorter  test  with  Its 
reliability  and  validity  equal  to  or  greater  than  that  of  the  original  test. 
The  shorter  test  may,  of  course,  be  obtained  by  using  only  a  sub-set  of  the 
items  In  the  original  test  and  thus  does  not  require  the  writing  of  new  test 
Items  but  just  an  elimination  of  I  terns  In  the  original  test.  This  is  an 
easy  change  to  make  In  a  test  and  one  that  should  be  considered  in  those 
cases  in  which  the  potential  reductions  In  cost  in  testing  more  than  offset 
the  potential  gains  from  Increasing  the  validity  of  the  test. 

There  is  another  reason  for  shortenirp  the  test  over  and  above  just  the  re¬ 
duced  cost  of  a  shorter  test.  In  some  instances,  a  certain  amount  of  test¬ 
ing  time  may  be  available  at  a  constant  cost.  Here,  the  leftover  testing 
time  may  be  put  to  good  use  by  introducing  new  tests  which  measure  other 
characteristics  of  the  individual.  This  new  battery  of  tests  may  be  of  much 
greater  "band  width"  (Cronbach  £  Gleser,  !  05)  and  may  greatly  improve  the 
performance  of  the  testing  program. 

,  »„ 

From  the  above,  It  should  be  apparent  that  there  must  exist  some  test  which 
can  be  made  at  the  same  time  shorter  and  more  valid  by  the  changeover  from 
choice  testing  to  admissible  probability  measurement.  What  characterizes 


-  3  * 


these  tests?  Are  all  tests  of  this  sort?  How  large  are  the  gains  that  can 
be  expected  from  changing  over  from  choice  testing  to  admissible  probability 
measurement? 

These  are  the  types  of  questions  that  will  be  answered  here.  As  before, 
logic  and  mathematics  will  be  employed  to  yield  very  ronservot Ive  estimates 
of  the  gains  resulting  from  the  use  of  admissible  probability  measurement. 
Equations  will  be  derived  which  give  test  reliability  a-'O  measurement  valid¬ 
ity  as  a  function  of  the  distribution  of  the  ability  levels  in  the  popula¬ 
tion  to  be  tested  and  of  the  amount  and  type  of  guessing  engaged  in  by  this 
population.  Then  additional  equations  will  be  derived  to  show  what  happens 
when  guessing  Is  eliminated  by  changing  over  to  admissible  probability  mea¬ 
surement.  These  equations  will  be  solved  and  the  results  plotted  over  the 
complete  range  of  parameter  values  for  the  most  Important  cases  to  be  en¬ 
countered  In  practice.  All  of  the  equations  hold  for  tests  of  any  length. 

The  formal  statement  of  the  testing  process  Is  similar  to  that  given  In  Sec¬ 
tion  B  of  the  First  Semi-Annual  Technical  Report  and  may  be  found  in  Mathe¬ 
matical  Appendix  A.  Briefly,  the  testing  process  is  based  on  the  Indepen¬ 
dent  sampling  of  test  items  from  a  large  pool  and  on  the  independent  sam¬ 
pling  of  persons  from  a  population  which  is  characterized  by  a  specified 
distribution  of  ability  levels  where  ability  level  Is  defined  as  the  propor¬ 
tion  of  items  In  the  pool  that  the  individual  knows. 


EVERYONE  GUESSES 

Assume  now  that  each  person  to  be  tested  knows  a  certain  proportion  of  the 
test  items  In  the  pool.  Let  this  proportion  be  represented  by  p  which  can, 
of  course,  range  from  zero  to  one.  Different  people  will  know  a  different 
proportion  of  the  items  and  thus  differ  in  ability  level  with  respect  to 
the  test  under  consideration.  This  distribution  of  ability  levels  is  rep¬ 
resented  by  a  beta  distribution  defined  over  the  Interval  from  zero  to  one. 
The  beta  distribution  has  two  parameters,  a  am;  b,  which  completely  determine 
Its  shape  and  location.  The  parameters  a  and  b  can  take  on  any  values 
greater  than  zero.  The  mean  of  the  beta  distribution  can  be  obtained  from 
the  two  parameters,  thus  p  -  a/(a+b) .  Also  the  variance  of  the  beta  distri¬ 
bution  can  be  written  In  terms  of  these  parameters,  thus 

o2  ■  ab/(a+b)  (a+b+l ) . 

P 

The  beta  distribution  Is  very  flexible  and  can  assume  many  shapes  depending 
of  course  upon  the  particular  values  of  a  and  b  selected. 

For  all  the  computations  carried  out  below  we  will  use  one  of  the  six  differ¬ 
ent  beta  distributions  shown  in  Figure  1.  Distribution  A  represents  an 
equal  distribution  of  ability  levels  over  the  population  to  be  tested.  This 
rectangular  distribution  is  not  likely  to  be  found  In  practice  but  is  of 
some  Interest  because  it  represents  on  extreme.  Distribution  B  represents 
the  distribution  of  ability  levels  for  a  test  of  some  considerable  diffi¬ 
culty;  while  Distribution  C  represents  a  distribution  of  ability  levels  for 
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a  test  which  is  rather  easy.  Distribution  D  is  of  considerable  Interest  be¬ 
cause  It  Is  an  approximation  of  conditions  that  are  often  found  in  practice. 
There  is  a  symmetric  distribution  of  ability  levels  with  considerable  vari¬ 
ation  over  the  range.  Distributions  E  and  F  represent  respectively  sym¬ 
metric  distributions  with  less  variability  in  ability  level,  with  Distrib¬ 
ution  F  probably  representing  the  other  practical  extreme  to  Distribution  A 
In  the  sense  that  In  A  the  ability  levels  ‘jre  quite  spread  out  while  in  F 
the  ability  levels  are  clustered  around  the  average  ability  level  of  1/2. 

Though  there  may  be  other  distributions  of  ability  level  of  interest  these 
six  represent  quite  a  few  types  and  probably  place  reasonable  practical 
bounds  on  those  that  will  be  encountered  in  practice.  The  equations  given 
below  can,  however,  be  used  to  derive  the  results  for  any  of  the  infinite 
number  of  beta  distributions  of  ability  levels,  but  we  will  solve  them  only 
for  these  six  distributions. 

Now,  if  a  person  knows  the  answer  to  a  particular  item  he  will  answer  it  and 
get  it  correct.  If,  on  the  other  hand,  he  docs  not  know  the  answer,  we  as¬ 
sume  for  the  purposes  of  this  section  that  the  person  will  guess  at  the  an¬ 
swer  and  that  he  has  a  probability,  6,  of  getting  the  item  correct  by  chance. 
The  guessing  level,  0,  can  range  from  zero  to  1/2.  If  0  *  0,  then  no  one  is 
guessing  on  the  test,  i.e.,  if  a  person  doesn't  know  the  answer  to  an  Item, 
he  leaves  it  blank.  If  0  *  1/2,  then  the  maximum  amount  of  guessing  pos¬ 
sible  is  occurring.  In  a  true-false  test  6  *  1/2  also  equals  the  minimum 

possible  amount  of  guessing  because  if  there  are  only  two  possible  answers, 
the  probability  of  chance  success  must  be  1/2.  However,  this  value  of 
0  ■  1/2  also  represents  a  maximum  amount  of  guessing  for  any  other  test. 
Forexample,  in  a  five  alternative  multiple-choice  test  the  guessing  level 
may  be  equal  to  1/2  because  people  may  have  enough  information  to  exclude 
three  of  the  five  alternatives  and  just  have  to  guess  between  the  remaining 
two.  Likewise,  in  a  constructed  response  and  f i 1 1- in-the-b lank  test  the 
guessing  level  may  be  1/2  If  people  tend  to  think  of  only  two  possible  an¬ 
swers  and  have  to  guess  which  of  the  two  is  right.  A  guessing  level  of 
0  °  1/5  is  probably  a  practical  minimum  in  the  sense  that  for  a  five  alter¬ 
native  multiple-choice  test  it  is  the  smallest  value  that  0  can  assume. 

That  Is,  if  a  person  has  no  Information  with  which  to  discriminate  between 
.he  five  alternatives  but  must  just  pick  one  at  random,  then  his  probability 
of  chance  success  is  1/5.  Note  that  if  he  does  have  any  information,  then 
his  probability  of  chance  success  would  be  much  greater  than  1/5.  Even  in 
constructed-response  tests  it  Is  unlikely  that  people  are  able  to  think  of 
mere  than  five  possible  answers  and  thus  achieve  a  guessing  level  of  less 
than  1/5.  So  we  will  take  €  *  1/5  to  be  a  minimum  practical  guessing  level 
and  0  *  1/2  to  be  a  maximum  possible  guessing  level,  though  we  will  inves¬ 
tigate  test  reliability  and  validity  over  the  complete  range  from  0*0 
(representing  what  can  be  achieved  by  admissible  probability  measurement)  up 
to  0  *  1/2. 

ONE  ITEM  TEST  RELIABILITY 

Suppose  that  two  test  items  were  selected  at  random  from  the  pool  of  test 


items  and  that  each  of  the  test  items  are  given  to  persons  selected  from  a 
population  with  distribution  of  ability  levels  characterized  by  the  beta 
^distribution  with  parameters  a  and  b.  if  a  person  answers  an  item  correctly, 
he  receives  a  Score  of  one  and  if  he  answers  incorrectly,  he  receives  a 
score  of  zero.  Each  of  the  two  Items  can  be  considered  a  separate  one- item 
test  and  the  correlation  between  these  test  scores  would  represent  the  one- 
i^em  test  reliability.  Suppose  further,  that  if  a  person  does  not,  know  the 
answer  to  an  item,  ho  guesses  with  probability,  6,  of  success.  Now  from 
Mathematical  Appendix  C  we  have  the  equation  for  a  one-item  test  reliability 
under  the  condition  that  everyone  guesses  at  level  8: 


rxy(0|a,b) 


(1-8)2o* 


where  pQ  ■  (l-6)£  +  0.  Notice  that  this  one-item  test  reliability  depends 
upon  three  parameters,  0,  a  and  b.  Examination  of  this  equation  reveals  un¬ 
equivocally  that  the  test  reliability  becomes  smaller  as  the  variance,  o2,  of 
the  distribution  of  ability  levels  becomes  smaller,  in  effect,  this  vari¬ 
ance  sets  an  upper  limit  on  test  reliability. 

it  Is  not  so  clear  how  guessing  level  0  affects  test  reliability  since  0 
appears  in  both  the  numerator  and  denominator  of  the  equation.  The  impact 
of  the  guessing  level  can  best  be  seen  by  fixing  the  distribution  of  ability 
levels,  by  specifying  a  and  b,  and  then  solving  the  equation  for  different 
values  6f  0.  The  results  of  such  computations  are  shown  in  Figure  2'for“the 
six  distributions  of  ability  level  plotted  In  Figure  1.  The  curves  in  Figure 
2  indicate  clearly  that  Increasing  the  guessing  level  decreases  test  reli¬ 
ability.  Examination  of  the  curves  for  different  distributions  of 'ability 
level  shows  that  test  reliability  Is  related  to  the  size  of  the  variance  of 
the  distribution  of  ability  levels.  And  further,  that  an  asymmetrical 
distribution  (Distribution  B  and  C)  interacts  with  guessing  level  to  effect 
tost  reliability.  The  easy  test  corresponding  to  Distribution  C  is  rather 
si ightly  affected  by  guessing  level  whereas  the  difficult  test  represented 
by  Distribution  B  is  greatly  affected  by  increasing  the  guessing  level.  All 
ip  all,  the  existence  of  testing  significantly  reduces  the  reliability  of  a 
test  especially  when  one  considers  that  .20  is  the  minimal  achievable  guess¬ 
ing  level  in  existing  choice  tests  and  that  .50  is  a  very  commonly  encounter¬ 
ed  guessing  level . 

N-iTEM  TEST  REUABiliTY 

Assume  now  that  a  longer  test  containing  n  items  is  formed  by  randomly 
selecting  additional  items  from  the  pool  of  test  items  so  that  two  equiv¬ 
alent  n-item  tests  are  obtained  and  given  to  persons  from  the  population  to 
be  tested,  in  this  case,  it  is  shown  in  Mathematical  Appendix  C  that  the 
Spearman-Brown  prophecy  formula  type  of  process  can  be  applied  to  yield  the 
correlation  between  these  two  tests  of  length  n.  ’  This  Is  the  test  reliability 
for  an  n-item  test  and  it  is  dependent  upon  four  parameters,  n,  0,  a  and  b. 
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Thus, 


(2) 


r  (n,0|a,b) 


nr  (e|a,b) 

_  _ 

1  +  (n-1)  rxy (0| a ,b) 


Notice  that  the.  n-ltem  test  reliability  is  a  function  of  the  length  of  the 
test,  r,,  and  of  the  one- Item  test  reliability.  This  i.icans,  of  course,  that 
by  knowing  the  reliability  for  a  one-item  test  we  can  determine  the  reli¬ 
ability  for  a  test  of  any  length.  Equation  (2)  implies  that  If  the  test  is 
made  longer  and  longer  (n  tends  toward  Infinity),  the  reliability  of  the 
lengthened  test  will  approach  one.  The  test  can  be  made  perfectly  reliable. 
This  Is  a  traditional  result  from  test  theory.  If  this  equation  were  solved 
and  plotted  as  a  function  of  n,  It  would  generate  a  curve  which  increases  by 
smaller  and  smaller  steps  as  n  Increases  and  asymptotically  approaches  a 
reliability  of  one  as  shown  by  two  of  the  curves  In  Figure  k.  This  ability 
to  solve  for  the  reliability  of  tests  of  any  length  Is  very  Important  since 
It  allows  us  to  infer  the  effect  of  shortening  a  test  and  to  compare  savings 
from  eliminating  guessing. 

ONE- ITEM  MEASUREMENT  VALIDITY 


Suppose  that  just  one  I  tern  Is  selected  from  the  pool  of  available  Items  and 
given  to  a  sample  of  personi  from  the  population  to  be  tested.  Each  person 
will  make  a  score  of  either  zero  or  one  point  depending  upon  whether  or  not 
they  answered  the  I  tern  correctly.  Each  person  Is  also  characterized  by  ,• 
having  a  certain  ability  level,  p,  corresponding  to  the  proportion  of  items 
in  the  complete  pool  to  which  he  knows  the  answer.  Now,  what  is  the  corre¬ 
lation  between  the  test  score  and  this  ability  level?  This  correlation  would 
represent  the  ability  of  the  test  to  measure  the  person's  ability  level  and 
In  this  sense  the  measurement  validity  of  the  test  can  be  derived  as  Is 
shown  in  Mathematical  Appendix  D.  The  one-item  measurement  validity  of  a  test 
In  which  everyone  Is  guessing  at  level  0  is 


(3) 


rxp(0|a,b) 


(l~6)op 

/Pe(,-Pe) 


As  in  the  case  of  test  reliability,  this  correlation  is  a  function  of  0,  a  and 
b,  l.e.,  it  is  affected  both  by  the  guessing  level  and  by  the  parameters  of 
the  distribution  of  ability  levels.  In  fact,  by  comparing  (3)  with  (1)  it 
may  be  seen  that  (3)  Is  nothing  more  than  the  square  root  of  (l).  Figure  3 
shows  how  this  one-item  test  measurement  validity  Is  affected  by  different 
levels  of  guessing  for  situations  based  on  the  different  distributions  of 
ability  levels.  The  effect  Is  quite  similar  to  that  obtained  for  test  reli¬ 
ability  and  it  must  be  so  due  to  the  direct  relation  between  the  two  corre¬ 
lations.  Test  measurement  validity  is  degraded  by  the  existence  of  guessing 
and  the  degradation  will  be  significant  for  levels  usually  encountered  in 
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practice. 

N- ITEM  TEST  MEASUREMENT  VALIDITY 


Now  suppose  that  we  randomly  take  a  sample  of  n  Items  from  the  pool  of  Items 
and  give  It  to  people  from  the  population  to  be  tested.  Consider  the  corre¬ 
lation  between  the  total  test  score  and  the  ability  level  for  this  test.  As 
In  the  case  of  test  reliability,  the  Spearman-Brown  prophecy  formula  type  of 
process  can  be  used  to  project  the  measurement  validity  of  a  test  of  any 
length  and  the  resulting  equation  as  derived  In  Mathematical  Appendix  D  Is 


(*) 


rxp<n.0|a.b) 


rxp(6|a,b)i/n 
^1  +  (n-l)r  (6|a,b) 

Ay 


Thus,  from  knowing  the  one-item  test  measurement  validity  we  can  obtain  the 
measurement  validity  for  a  test  of  any  length.  This  equation  Is  really  just 
the  square  root  of  (2)  and  thus  has  quite  similar  properties.  For  example, 
as  the  sample  sl2e  Increases  without  bound  the  measurement  validity  of  the 
lengthened  test  approaches  one.  See  Figure  *». 

MAXIMUM  REDUCTION  IN  TEST  LENGTH  POSSIBLE  WITHOUT  REDUCING  EITHER  TEST 
RELIABILITY  OR  MEASUREMENT  VALIDITY. 


Since  the  foregoing  equations  Imply  that  the  existence  of  guessing  reduces 
both  test  reliability  and  measurement  validity  and  that  both  of  these  quanti 
ties  are  a  function  of  the  length  of  the  test,  several  interesting  results 
may  be  deduced.  First,  and  most  obviously,  If  admissible  probability  mea¬ 
surement  were  used  to  eliminate  guessing  In  an  existing  test  then  the  re¬ 
sulting  total  test  score  would  be  both  more  reliable  and  valid.  Second, 
though  the  elimination  of  I  terns  from  this  new  test  would  decrease  both  test 
reliability  and  measurement  validity,  there  Is  generally  a  range  of  reduced 
test  lengths  over  which  the  .guessing-free  test  will  be  both  more  reliable 
and  valid  than  the  longer  guessing-contaminated  test.  There  will,  of  course 
also  be  a  range  of  reduced  test  lengths  over  which  the  guessing-free  test 
will  be  less  reliable  and  valid  than  the  much  longer  guesslng-contamlnaied 
tests.  And  there  will  be  one  unique  reduced  test  length  at  which  the  guess¬ 
ing-free  test  will  have  essentially  the  same  reliability  and  validity  as  the 
much  longer  guessing-contaminated  test.  It  Is  shown  In  Mathematical  Appen¬ 
dix  E  that  this  reduced  tost  length,  np,  at  which  the  reliability  and  valid¬ 
ity  of  the  guesslng-free  test  exactly  matches  that  of  the  original  guessing- 
contaminated  test  can  be  obtained  by  solving  ;v! 

nr  (e(a,b) [1-r  (0|a,b)] 

(5)  n  ■  6  xy  1  xy  1 

0  rxJ0|a,b)  [l-r  ( 6 1 a ,b)  ] 

where  n@  is  the  length  of  the  original  guessing-contaminated  test  with  one- 
item  test  reliability  r  (§|2,b)  while  the  one-item  test  reliability  of  the 

Ay 


guessing-free  test  is  rXy(0|a,b). 

The  value  ng  -  np  Is  typically  a  maximum  possible  reduction  In  test  length 
In  the  sense  that  very  seldom  would  one  want  to  reduce  the  test  reliability 
and  measurement  validity  below  that  yielded  by  the  original  guessing-con¬ 
taminated  test.  Any  test  length  smaller  than  n*  would  yield  a  test  reli¬ 
ability  and  measurement  validity  smaller  than  that  of  the  original  test 
while  a  test  length  greater  than  no  would  yield  a  test  reliability  and  mea¬ 
surement  validity  larger  than  that  of  the  original  test. 

The  examination  of  (5)  clearly  indicates  that  maximum  reduction  in  test  length 
possible  Is  a  function  of  the  length  of  the  original  test  ng.  The  depen¬ 
dence  of  nQ  on  the  guessing  level,  6,  occurring  in  the  original  test  and  up¬ 
on  the  distribution  of  ability  levels  as  represented  by  a  and  b  Is  not  clear 
from  examination  of  (5)  .  Therefore,  we  have  set  nQ  -  100,  a  fairly  typical 
length  for  a  test  used  for  personnel  decisions,  and  have  solved  (5)  for  dir- 
ferent  values  of  0  for  each  of  the  six  distribution  of  ability  levels  con¬ 
sidered  in  this  report.  The  results  are  plotted  in  Figure  5.  Remember  that 
0  represents  a  rather  classical  distribution  of  ability  levels.  In  this  case, 
if  the  guessing  level  In  the  original  100-item  test  were  at  the  minimal  value 
of  1/5  then  changing  to  admissible  probability  measurement  could  result  In 
reduction  of  the  length  of  the  test  to  about  a  63-1  tern  test,  while  If  guess¬ 
ing  were  occurring  at  the  maximum  level  of  1/2,  then  the  admissible  probability 
test  could  be  made  as  short  as  30  items.  Somewhat  greater  savings  result 
in  the  case  of  a  more  difficult  test  as  indicated  by  Curve  3  while  slightly 
smaller  savings  result  In  the  case  of  easier  tests  as  Indicated  by  Curve  C. 

The  reduction  In  test  length  Indicated  by  these  curves  is  not  insignificant. 

The  savings  are  of  such  magnitude  that  using  admissible  probability  measure¬ 
ment  to  eliminate  guessing  means  that  now  two  or  three  tests  can  be  given  to 
increase  a  "bandwidth"  of  the  testing  program  without  any  increase  in  total 
testing  time.  The  Increased  "bandwidth"  could  yield  a  very  great  improve¬ 
ment  in  the  overall  testing  process  as  argued  by  Cronbach  6  Gleser  (1965). 

It  should  be  understood  that  the  use  of  admissible  probability  measurement 
does  not  require  reduction  in  the  length  of  the  test  to  exactly  the  amount 
indicated  in  these  figures,  if  the  reduced  test  is  shorter  than  that  in¬ 
dicated  on  the  curves  then  it  will  be  less  reliable  and  valid  than  the  orig¬ 
inal  tests,  while  if  it  is  longer  it  will  be  more  reliable  and  valid  than 
the  original  guessing-contaminated  test.  The  optimal  length  of  the  new  test 
should  be  determined  by  careful  comparlsions  of  the  value  of  increased  va¬ 
lidity  with  the  value  of  reduced  test  length,  possibly  to  increase  "band¬ 
width"  of  the  testing  program. 


SOME  PERSONS  GUESS,  OTHERS  DON'T 


TEST  RcLI ABILITY 


The  situation  analyzed  above  is  realistic  for  many  applications  of  testing 
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but  not  for  ail.  To  be  more  explicit,  suppose  that  some  of  the  testing  poou 
iation  were  test-wise  and  would  guess  whenever  they  did  not  know  the  answer 
to  a  test  Item  while  others  In  the  population  to  be  tested  were  not  test- 
wise  and  Invariably  would  choose  to  skip  an  I  tern  rather  than  to  guess  at  its 
answers.  This  situation  Is  sometimes  encountered  In  testing  programs.  It 
Is  not  tovered  by  the  analysis  given  above  since  there  It  was  assumed  that 
everyone  guesses.  Here  we  will  assume  that  a  certain  prooortion,  q,  of  the 
population  to  be  tested  will  guess  at  level  0  and  that  the  rest  of  the  test¬ 
ing  population,  represented  by  the  proportion  1-q,  will  never  guess  on  the 
test.  This  is  such  a  basic  change  In  the  description  of  the  testing  process 
that  It  Is  quite  possible  that  It  will  make  significant  changes  In  our  con¬ 
clusions.  As  shown  In  Mathematical  Appendix  C  the  one-item  test  reliability 
when  a  proportion  q  of  the  population  to  be  tested  guesses  at  level  0  is: 


(6) 

rxy 

whe  re 

pqe  ' 

P22  ' 

and 

The  n-i 

1  tern  test  reliability 

(7) 

r  (n,q 
xy  ^ 

“  P,2 


,  K22  *q  6 

r  (q,6  a,b)  ■  - a — 


Pq6(l'P,e) 


„2  «  - ■(«♦».) - 

(a+b) (a+b+I ) 


nr  (q,0|a,b) 
_ 21 _ _ _ 


I  +  (n-i)r  (q,0|a,b) 

Ay 


Equation  6  is  too  complicated  to  tell  by  inspection  how  the  different  pa¬ 
rameters  q,  6,  a  and  b  affect  test  reliability.  However,  by  solving  the 
equation  for  the  six  distributions  of  ability  levels  considered  here,  we 
can  gain  some  Idea  as  to  how  ability  level,  guessing  level  and  the  propor 
tlon  guessing  affects  test  reliability. 


If  1/2  of  the  population  to  be  tested  guesses  at  level  0  while  others  in  the 
population  do  not  guess  then  test  reliability  varies  as  a  runctlon  of 
guessing  level  as  shown  In  Figure  6.  These  results  are  quite  different  from 
those  obtained  and  graphed  in  Figure  2.  In  Figure  2,  increased  guessing 
always  lowers  test  reliability.  In  contrast  to  that  conslstant  and  neat 
result,  we  now  find  that  Increased  guessing  can  serve  to  Increase  test  re¬ 
liability  rather  than  to  reduce  It. 


Does  this  strange  result  of  guessing  increasing  test  reliability  hold  only 
when  1/2  of  the  tested  population  guesses  or  docs  It  hold  for  other  situ¬ 
ations  too?  Figure  7  examines  test  reliability  for  the  minimal  guessing 
level  of  0  *  1/5  for  all  possible  proportions  q,  all  the  way  from  the  ex- 
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treme  case  of  no  one  in  the  tested  population  guesses  to  the  case  In  which 
everyone  in  the  tested  population  guesses.  Figure  3  shows  the  same  anal¬ 
ysis  for  the  case  in  which  all  guessing  is  done  at  the  maximal  level,  0  *  1/2. 
Examination  of  these  two  figures  shows  that  the  phenomenon  is  not  unique 
for  q  *  1/2. 

These  results  certainly  cast  doubt  upon  the  gains  to  be  expected  from 
changing  over  to  admissible  probability  measurement.  To  be  more  specific, 
if  the  testing  situation  Is  one  in  which  some  of  the  people  guess  while 
others  don't  then  changing  to  admissible  probability  measurement  to  elim¬ 
inate  guessing  may  result  in  a  test  of  lesser  reliability  than  that  of  the 
original  test.  In  such  cases,  a  lengthening  of  the  test  may  be  required  to 
yield  the  same  reliability  as  that  of  the  original  test.  Whether  a  gain  or 
a  loss  is  realized  from  the  changeover  to  admissible  probability  measure¬ 
ment  depends  very  critically  upon  the  combination  of  parameter  values  appro¬ 
bate  to  the  test  situation.  Taken  together  the  results  of  this  new  analysis 
suggest  that  changing  over  to  admissible  probability  measurement  may  yield 
only  slight  benefits  and  in  many  cases  will  actually  impair  the  reliability 
of  the  test. 

Before  concluding,  however,  that  changing  over  to  admissible  probability 
measurement  holds  little  promise  for  improving  testing  it  might  be  worth¬ 
while  to  take  a  look  at  the  measurement  validity  of  a  test  used  with  a 
population  where  some  persons  guess  while  others  don't. 

TEST  MEASUREMENT  VALIDITY 

As  shown  in  Mathematical  Appendix  D,  the  one-item  measurement  validity  for 
such  a  test  is 

( 1 -q  0) o 

(b)  r  (q,0|a,b)  • 

b  *^p  A(  i  -5 75 

►qOv  ^qD' 

The  n- i tem  measurement  validity  is 

r  (q.Sia.b)1^ 

a.b)  =  - 2E - 

1  +  r  (q50|a,b) 

*y 


(9) 


r  ( n ,  q ,  0  I 
xp  ^  1 


Figure  9  shows  the  one-item  measurement  validity  for  each  of  the  six  dis¬ 
tributions  of  ability  level  when  half  the  persons  guess.  Notice  that  in  this 
case,  increasing  the  guessing  level  results  in  decreased  test  validity. 

Figure  10  shows  one-item  measurement  validity  when  a  proportion  q  of  the 
tested  population  are  guessing  at  a  minimal  guessing  level  0  -  1/5.  Here 
notice  that  increasing  the  proportion  of  people  guessing  decreases  test  va¬ 
lidity.  Figure  II  shows  the  corresponding  results  when  those  persons  guess- 


ing  are  guessing  at  a  maximum  guessing  level  6  ■  i/2.  Again  increasing  the 
proportion  of  persons  guessing  decreases  test  validity.  These  results  are 
more  in  accord  with  what  was  found  before.  The  results  shown  in  these 
figures  and  the  results  of  other  computations  indicate  that  guessing  in  any 
amount  done  by  any  proportion  of  the  tested  population  can  never  increase 
test  measurement  validity.  This  result  agrees  with  intuition  much  better 
than  the  result  concerning  test  reliability. 


Look  at  (8)  and  notice  that  the  test  measurement  validity  is  no  longer  equal 
to  the  square  root  cf  the  test  reliability.  This  breakdown  of  the  relation 
between  test  reliability  and  validity  has  a  very  important  implication  which 
can  be  seen  by  examining  (9)*  Now,  the  measurement  validity  of  the  test 
when  tests  of  length  greater  than  one  are  considered  depends  both  upon  the 
one-item  test  measurement  validity  and  upon  the  one-item  test  reliability 
which  can  no  longer  be  expressed  in  terms  of  one  another.  In  testing  situ¬ 
ations  where  everyone  guesses  we  found  that  lengthening  the  test  indefinitely 
made  both  test  reliability  and  measurement  validity  approach  the  maximum 
possible  value  of  one,  that  is,  resulted  in  a  completely  reliable  and  com¬ 
pletely  valid  test.  In  this  new  situation  where  some  guess  while  others 
don't,  increasing  the  length  of  the  test  without  bounds  results  in  the  test 
reliability  approaching  maximum  possible  value  of  one,  but  the  test  measure¬ 
ment  validity  may  approach  some  other  value  less  than  one.  See  Figure  ^ 
for  an  illustration  of  this.  That  is,  depending  upon  the  circumstances,  it 
may  be  impossible  to  obtain  a  completely  valid  test.  But  the  fact  that  some 
persons  guess  while  others  don't  sets  an  absolute  upper  limit  on  the  measure¬ 
ment  validity  that  can  be  yielded  by  the  test  no  matter  how  long  it  is.  This 
upper  limit  on  test  measurement  validity  is 


00) 


rxp(“»q.ela.b) 


r  (q»6|a»b) 

,  ..-r _ _ _ 

/ryv(<l»ela»b) 


While  the  maximum  percent  of  true  variance  which  can  be  accounted  for  by  the 
test  of  infinite  length  is  given  by  the  square  of  (iO).  This  percent  of  true 
variance  accounted  for  is  considered  a  better  measure  of  test  performance 
than  test  measurement  validity. 


Figure  i2  shows  the  upper  limit  in  the  percent  of  true  variance  accounted 
for  even  if  the  test  is  of  infinite  length  for  the  case  in  which  i/2  of  the 
persons  guess  at  ievei  0  for  each  of  our  six  distributions  of  ability  ievei. 
These  curves  indicate  that  any  amount  of  guessing  degrades  the  performance 
of  the  test.  This  degradation  becomes  of  some  significance  for  even  a  minimal 
guessing  ievei  of  0  *  i/5  and  increases  much  more  by  the  time  the  guessing 
ie'-ei  reaches  a  maximum  of  0  *  i/2.  Thus,  if  some  persons  in  the  tested 
population  guess  while  others  don't,  the  use  of  a  conventional  choice  method 
which  allows  for  guessing  means  that  there  wi i i  be  a  barrier  to  the  maxi¬ 
mum  performance  that  can  be  roalized  by  the  test.  This  barrier  can  not  be 
breached  as  long  as  we  continue  to  use  conventional  choice  methods  for  test 
admin  is  t rat  ion . 
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Figures  13  and  1^  show  the  upper  limit  of  test  performance  when  those  per¬ 
sons  guessing  guess  at  the  minimum  level  and  at  the  maximum  level.  These 
figures  make  it  quite  clear  that  test  performance  is  degraded  whenever  one 
encounters  a  mixed  population  where  some  persons  guess  and  others  don't, 
it  is  much  better  either  to  have  everyone  guessing  or  no  one  guessing  and 
these  are  the  only  two  situations  that  eliminate  the  barrier  cn  maximum  test 
performance. 

MAXIMUM  REDUCTION  IN  TEST  LENGTH  POSSIBLE  WITHOUT  REDUCING  MEASUREMENT  VALIDITY 

So  far  the  results  apply  only  to  tests  of  infinite  length.  What  happens  when 
we  consider  the  more  realistic  case  of  using  a  test  of  finite  length?  In 
particular,  we  can  consider  the  maximum  reduction  in  test  length  possible  by 
changing  over  to  admissible  probability  measurement  to  eliminate  guessing. 

In  this  new  situation,  we  will  of  course  get  different  results  depending  upon 
whether  we  equalize  reliabilities  or  measurement  validities.  Since  measure¬ 
ment  validity,  not  reliability,  is  the  real  measure  of  test  performance,  we 
need  concern  outselves  only  with  measurement  validity,  if  the  guessing-free 
test  has  the  same  or  greater  validity,  we  don't  really  care  that  it  has  less 
reliability  than  the  original  guessing-contaminated  test.  This  may  seem 
counter-intuitive  but  the  next  major  section  below  may  provide  some  under¬ 
standing  of  why  we  shouldn't  pay  too  much  attention  to  test  reliability 
when  the  test  is  affected  by  guessing.  As  before,  wo  can  solve  for  the  re¬ 
duced  test  length,  nQ,  at  which  the  validity  of  the  guessing-free  test  exactly 
matches  that  of  the  original  guessing-contaminated  test  as  shown  in  Mathe¬ 
matical  Appendix  E: 

nq0r2xp(q,0la,b) ^  "  rxy(°la>b)] 

(11)  nQ  ■*  - - - - - 

rxy (0 1  a , b ) (l  -  rxy(q,e|a,b)  +  nq0[rxy(q,p|a,b)  -  r2xp(q,e|a,b)  ] } 

where,  r  (0|a,b)  is  the  one-item  reliability  of  the  guessing-free  version 
of  the  te^t. 

This  reduced  test  length,  nQ,  is  shewn  for  a  100-item  test  given  to  a  pop¬ 
ulation  with  Distribution  0  of  ability  levels  some  of  whom  guess  at  the 
minimum  level  in  Figure  15  and  at  the  maximum  level  in  Figure  16.  Test  re¬ 
liabilities  and  measurement  validities  are  also  shown  on  these  graphs.  The 
measurement  validities  for  the  new  guessing-free  test  and  for  the  old  guess¬ 
ing  contaminated  test  will  of  course  be  the  same.  The  test  reliabilities, 
however,  are  different  with  the  new  guessing  free  test  having  somewhat  less 
rdliability  than  the  old  guessing-contaminated  <est  of  greater  length. 

As  to  the  reduced  test  length  possible,  the  effect  of  different  guessing 
strategies  is  quite  dramatic,  in  the  case  of  minimal  guessing,  if  everyone 
guesses,  the  new  test  can  bo  reduced  to  about  63  items,  if,  however,  about 
half  of  the  people  guess  while  the  others  don't,  then  the  new  test  can  be 
reduced  to  about  3**  items.  In  the  case  of  maximal  guessing,  if  everyone 

-  13  - 


guesses,  the  new  test  can  be  reduced  to  about  30  items  while  if  about  1/2 
the  people  guess  while  the  others  don't,  the  new  test  can  be  reduced  to 
about  5  or  6  items.  The  existence  of  differences  in  guessing  strategy  In 
the  tested  population  can  greatly  degrade  test  performance  and,  conversely, 
can  mean  that  even  greater  benefits  will  be  yielded  by  changing  over  to 
admissible  probability  measurement.  Reducing  the  length  of  a  test  to  1 /20th 
of  its  original  length  or  increasing  its  validity  from  the  high  60 1 s  Into 
the  high  SO ' s  merely  by  changing  the  method  of  test  administration  is  not  a 
trivial  benefit. 


The  same  computations  have  been  performed  for  the  other  five  of  the  six  dis¬ 
tributions  of  ability  level.  The  results  are  not  too  different,  with  greater 
gains  in  some  cases  and  somewhat  smaller  gains  in  others. 

TEST  RELIABILITY  WHEN  MEASUREMENT  VALIDITY  IS  ZERO. 


It  may  be  instructive  to  Investigate  what  happens  to  test  reliability  when 
there  is  no  validity  whatsoever  to  the  test.  By  zero  measurement  validity 
we  mean  that  the  probability  of  a  person  from  the  tested  population  knowing 
an  item  varies  independently  from  item  to  item.  His  ability  level  is  an 
independent  random  variable  distributed  according  to  the  distribution  of 
ability  levels.  Thus,  learning  whether  a  person  knows  a  given  item  tells 
you  nothing  about  whether  or  not  he  knows  the  answer  to  another  item.  There 
is  no  validity  whatsoever  in  such  a  test. 


Now  if  none  of  the  persons  in  the  tested  population  were  guessing,  the  test 
would  have  no  reliability  whatsoever.  If,  however,  a  proportion  q  of  people 
consistently  guess  when  they  do  not  know  the  answer  while  the  rest  of  the 
persons  refuse  to  guess,  would  the  reliability  still  be  zero  for  such  a 
situation?  As  shown  in  Mathematical  Appendix  1  the  reliability  for  such  a 
test  with  zero  validity  but  where  a  proportion  q  of  persons  with  a  dis¬ 
tribution  of  ability  levels  of  mean  p  are  guessing  at  level  6  Is 


(12) 


rxy(q.e|p) 


where  p2  «  qpQ  +  (l-q)p. 


q(|-q)02(l-p)2 


P2(  ,_P2) 


This  equation  is  not  always  equal  to  zero.  In  fact,  it  is  generally  other¬ 
wise  if  there  is  any  guessing  whatsoever.  Figure  17  shows  test  reliabilities 
for  a  100-item  test  as  a  function  of  the  various  levels  of  guessing.  The 
reliabilities  can  become  quite  large  even  if  there  is  a  minimal  amount  of 
guessing.  Figures  18  and  19  show  the  same  thing  for  minimal  and  maximal 
levels  of  guessing  as  the  proportion  of  people  adopting  a  guessing  strategy 
varies.  Maximum  reliability  is  observed  when  about  1/2  the  persons  guess 
and  1/2  do  not. 


Differences  in  guessing  strategies  among  people  can  contribute  to  the  re 
liability  of  a  choice  test.  In  one  sense  this  reliability  Is  completely 
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spurious  in  that  the  test  has  no  measurement  validity.  In  another  sense 
though,  the  reliability  is  not  spurious.  What  Is  causing  this  reliability 
are  the  differences  in  guessing  strategy  of  the  tested  population.  This  Is 
also  what  the  test  is  measuring,  if  it  Is  measuring  anything.  More  spe¬ 
cifically  the  test  is  measuring  the  test-wiseness  of  the  tested  population. 
This  test-wiseness  would  also  enter  into  the  behavior  of  the  tested  pop¬ 
ulation  on  any  other  choice  test  that  they  might  take.  Therefore  the  cor¬ 
relation  between  any  two  choice  tests  would  be  positive  and  there  would  be 
at  least  apparent  validity  between  any  two  choice  tests.  In  some  cases, 
this  can  be  a  real  validity  though  it  Is  usually  not  realized  by  the  user  of 
the  test  information.  These  tests  are  measuring  how  test-wise  people  arc. 
This  test-wiseness  can  in  turn  reflect  how  much  experience  persons  have  hod 
taking  tests  which  in  turn  con  reflect  their  level  of  educational  attain¬ 
ment,  socio-economic  background,  race  and  various  other  factors.  So,  in 
this  sense,  these  tests  have  some  validity.  However,  it  would  seem  to  be 
much  more  efficient  to  just  have  a  person  fill  out  a  questionnaire  directly, 
stating  his  level  of  educational  attainment,  socio-economic  background,  race, 
etc.,  rather  than  having  this  be  indicated  indirectly  through  spurious  and 
misleading  test  results. 

It  should  be  realized  that  if  admissible  probability  measurement  were  used  to 
administer  such  a  test,  guessing  would  be  completely  eliminated  and  zero 
test  reliabilities  and  measurement  validities  would  be  obtained.  The  test 
would  be  shown  to  be  worthless  as  a  tes t  ■  One  wonders  how  much  of  the  re¬ 
liability  and  validity  of  certain  widely  used  tests  Is  of  just  this  sort. 

Now,  for  the  first  time,  we  can  hope  to  find  out. 

SUMMARY  AND  CONCLUSIONS 

Logic  and  mathematics  has  been  used  to  deduce  the  effect  of  guessing  upon 
test  reliability  and  measurement  validity.  Analysis  shows  choice  testing 
to  be  highly  sensitive  to  the  degrading  effects  of  guessing.  In  fact,  the 
extent  of  degradation  is  so  unexpectedly  large  os  to  be  almost  unbeliev¬ 
able  when  viewed  in  light  of  the  current  consensus  that  the  existence  of 
guessing  has  near  trivial  effects  on  the  performance  of  objective  and  semi¬ 
objective  tests.  It  does  appear,  however,  that  the  logic  is  inescapable 
and  is  in  accord  with  current  and  meaningful  usage  in  test  theory. 

Fortunately  we  are  now  in  a  position  to  resolve  this  paradox.  The  logic 
of  decision-theoretic  psychometrics  promises  that  we  can  eliminate  the 
effects  of  guessing  from  any  test  simply  by  changing  over  from  a  choice 
method  to  admissible  probability  (Teasurement  in  the  administration  of  the 
test.  Then,  by  carrying  out  the  standard  psychometric  analyses  one  con 
empirically  determine  the  extent  to  which  the  original  choice  test  was 
affected  by  testing  and  the  benefits  resulting  from  the  changeover  to 
admissible  probability  measurement. 

What  is  the  likely  result  of  such  empirical  studies?  Will  admissible  probab- 


i  li ty  measurement  yield  the  rather  overwhelming  benefits  indicated  by  the 
foregoing  analysis?  The  major  hazard  to  confirmation  may  be  that  the  mathe¬ 
matics  is  not  complex  enough  to  adequately  represent  the  testing  situation. 
More  specifically,  the  mathematics  arc  based  upon  the  assumption  that  every¬ 
body  guesses  at  the  same  level.  This  is  in  general  not  true  since  items 
may  vary  in  the  potency  of  their  misleads  and  individuals  may  vary  In  the 
extent  to  which  they  have  partial  knowledge  of  the  subject.  At  considerable 
cost  the  mathematics  may  be  generalized  to  allow  for  a  variable  guessing 
level.  It  appears  that  such  a  generalization  would  yield  rather  well- 
behaved  equations  which  arc  affected  by  an  average  guessing  level.  Thus, 
the  results  would  fall  somewhere  between  the  minimal  and  maximal  guessing 
levels  encountered.  While  this  means  that  a  maximal  benefit,  say  reducing 
the  length  of  the  test  to  1/20th  of  the  original  size,  will  not  be  obtain¬ 
ed  in  practice,  neither  will  a  minimal  benefit  of  reducing  the  length  of 
the  test  by  1/3  be  obtained,  but  rather  something  in  between,  say,  re¬ 
ducing  the  length  of  the  test  by  a  factor  of  four  or  five.  Such  gains 
still  remain  of  considerable  practical  importance  for  many  testing  appll- 
cat i ons . 

Another  respect  in  which  the  mathematics  may  be  questioned  is  that  they  are 
based  on  a  single  ability  level  which  is  constant  from  item  to  item.  How¬ 
ever,  whether  this  is  a  constant  ability  level  or  an  average  ability  level, 
makes  no  difference  to  the  analysis  as  long  as  one  is  concerned  with  total 
test  score.  On  the  other  hand,  if  one  is  concerned  with  item  analysis, 
this  assumption  docs  make  a  difference  and  should  be  considered. 

If  it  is  considered,  it  can  work  In  favor  of  changing  over  to  admissible 
probability  measurement.  It  does  it  In  this  way:  Some  of  the  analysis 
given  above  was  concerned  with  reducing  the  length  of  the  test  to  trade  off  i 
creased  reliability  and  validity  against  reduced  testing  time  and  greater 
"bandwidth"  of  the  testing  battery.  These  parts  of  the  analysis  did  not 
discriminate  between  test  items  in  the  sense  that  they  assumed  either  that 
all  the  items  have  the  same  characteristics  or  that  a  random  sample  of  items 
would  be  selected  from  the  original  test.  In  practice,  this  restriction 
does  not  have  to  be  maintained.  To  be  explicit,  one  can  look  at  the  item 
characteristics  in  terms  of  the  data  obtained  from  admissible  probability 
measurement  and  select  the  best  possible  sub-set  of  items,  say  the  sub-set 
of  items  that  yield  maximum  test  validity.  Such  an  approach  based  on  an 
analysis  of  the  structure  of  the  test  items  will,  in  general,  mean  even 
greater  reductions  in  the  length  of  the  test  resulting  from  the  conversion 
to  admissible  probability  measurement.  For  example,  in  a  test  in  which  the 
items  are  completely  redundant,  such  an  analysis  would  indicate  that  the 
test  can  new  be  reduced  down  to  a  one- item  test  and  have  the  same  validity  as 
the  original  test  of  any  length.  This  is  admittedly  an  extreme  case,  but  it 
shows  what  is  possible  by  analyzing  the  structure  of  test  items  and  se¬ 
lectively  choosing  the  items  to  be  included  in  the  revised  guesslng-frce  test 

So,  all  in  all,  it  appears  that  the  existence  of  guessing  has  seriously  de¬ 
graded  the  performance  of  objective  and  semi -object i vc  test  programs, 
new  method  of  test  administration,  admissible  probability  measurement, 
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promises  to  eliminate  the  effect  of  guessing.  This  holds  out  the  hope 
that  testing  programs  may  be  greatly  Improved  and  that  new  and  exciting 
uses  may  be  found  which  will  greatly  enlarge  the  scope  of  application  of 
objective  and  semi -objective  tests. 
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Figurt  3 

TEST  MEASUREMENT  VALIDITY  (ONE- ITEM) 
WHEN  EVERYBODY  GUESSES  AT  LEVEL  Q . 
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Figure  5 

REDUCED  TEST  LENGTH  POSSIBLE 
'  '*  WH&N  EVERYBODY  GUESSES  AT  LEVEL  Q 
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i  Figure  II 

|  TEST  MEASUREMENT  VALIDITY  (ONE -ITEM)  WHEN  PROPORTION  q 

'  OF  PERSONS  GUESS  AT  LEVEL  0=  £ 
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Figure  12 

MAXIMUM  PERCENT  OF  TRUE  VARIANCE 
ACCOUNTED  FOR  (TEST  OF  INFINITE  LENGTH) 

WHEN  PROPORTION  q  =  y  OF  PERSONS  GUESS  AT  LEVEL  Q 
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REDUCED  TEST  LENGTH  POSSIBLE  FOR  DISTRIBUTION  D 
WHEN  PROPORTION  q  OF  PERSONS  GUESS  AT  LEVEL  0=| 
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Figure  17 

100- ITEM  TEST  RELIABILITY  WHEN  PROPORTION 
q=-Jr  OF  PERSONS  GUESS  AT  LEVEL  9  FOR  A 
TEST  WITH  ZERO  MEASUREMENT  VALIDITY 


I  .2  .3  4  .5  .6  .7  .8  9  1.0 

q 

Figure  19 

i- ITEM  TEST  RELIABILITY  WHEN  PROPORTION 
OF  PERSONS  GUESS  AT  LEVEL  0=|  FOR  A 
fEST  WITH  ZERO  MEASUREMENT  VALIDITY 
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MATHEMATICAL  APPENDICES 


A.  FORMAL  STATEMENT  OF  THE  TESTING  PROCESS 

Define: 

q  =  proportion  of  persons  who  guess 

0  =  probability  of  success  If  person  does  guess 

p  =  proportion  of  items  in  pool  that  person  knows 

X  =  a  random  variable  yielding  "1"  if  person  answers  item  correctly, 
"0"  otherwise. 

Y  =  a  random  variable  yielding  "I"  if  same  person  answers  another 
item  correctly,  "0"  otherwise. 

Z  =  a  random  variable,  independent  of  p,  yielding  "I"  if  person 
guesses,  "0"  otherwise. 

Test  items  are  randomly  and  independently  selected  from  large  pool  of 
i  terns . 

Persons  are  randomly  selected  from  relevant  population  for  administra¬ 
tion  of  tests. 

B.  BETA  DISTRIBUTIONS  OF  ABILITY  LEVEL 

Since  p  is  the  proportion  of  items  in  the  test  pool  that  a  given  indi¬ 
vidual  knows,  it  can  serve  as  a  comprehensive  measure  of  his  level  of  ability 
or  achievement.  It  is,  at  least,  superior  to  the  notion  of  true  score  or 
ability  level  of  classical  test  theory  based  upon  the  number  of  items  an¬ 
swered  correctly.  This,  of  course,  would  be  contaminated  by  the  effects  of 
guessing. 

Since  p  has  a  finite  range  from  zero  to  one,  it  is  not  natural  to  use  a 
normal  distribution  to  represent  the  frequency  of  occurrence  of  p  in  a  pop¬ 
ulation.  The  most  commonly  used  and  flexible  distribution  for  a  variable 
defined  over  the  Interval  (0,1]  is  the  beta  distribution: 

fe(p|j’i>)  '  Bt..bT  P*‘~l(|-P>b~' 


where  a,b  >  0. 


C.  TEST  RELIABILITY 


One-item  test  reliability.  We  need  to  know  the  joint  and  marginal 
probabilities  for  the  random  variables  X  and  Y.  Thus, 

X 


"0" 

n  1 1 1 

"0"  P!1 

P 1 2 

P1 

"1"  p2 1 

CSJ 

CM 

a. 

P2 

P-| 

P-2 

1 

Decomposing  the  joint  probabilities: 


Pjj  -  Pr (X«i ,  Y«j,  Z-l)  +  Pr(X«i,  Y-j,  Z-0) 

»  Pr(X*i,  Y*j|Z“l)q  +  Pr(Xai,  Y*=j  |  Z=0)  ( 1 -q)  . 

By  further  decomposition  and  by  integrating  f  (p|a,b)  over  p  we  ob 
tain  15 


(C  1)  P22  ■  q[(l-0)2u2  +  2  0  ( 1  -  6)  p  +  02  ]  +  (l-q)p2  '( 

<: 

(C.2)  P2  “  q  [  ( 1 - 0) p  +  0]  +  (l-q)p. 


Since 

(C.3) 


we  obtain  (6)  by  substituting  (C.l)  and  (C.2)  into  ( C . 3)  and  re¬ 
naming  p2  as  Pq(,  and  we  obtain  (1)  by  setting  q  =  1  and  simplifying. 

n-item  test  reliability.  Let  X^and  Y^  represent  two  n-item  tests 
obtained  by  sampling  from  the  large  pool  of  items.  Then 


>>♦  Y<"> 


and 

(C.A) 


rx(n)y(n) 


2 

ax(n)+y(n) 


2  _  2 
°x(n)  ay(n) 


2ax(n)°Y(n) 


\ 


Upon  expanding  the  right  hand  side  of  (C.l*)  down  to  the  level  of  in¬ 
dividual  items  realizing  that  all  I  tern  variances  are  equal  due  to  the  nature 
of  the  basic  testing  process  and  that  all  item  inter-correlations  are  equal 
to  rXy,  (C.**)  through  simplification  becomes  either  (2)  or  (7)  depending  up¬ 
on  the  definition  of  r  . 

Ay 

D.  MEASUREMENT  VALIDITY 


After  integration  over  p,  we  obtain: 

(D.  3)  E (Xp)  -  q  t ( l -e) P2  +  ep]  ♦  (l-q)u2 

(D. 4)  E (X)  -  q [ ( 1 -6) p  +  6]  +  (l-q)p 


(D.5)  E ( p)  -  p. 

Substituting  ( D .  3)  >  (D.4),  and  (0.5)  in  ( D .  2)  and  simplifying  we  obtain: 

'  I 

(D.6)  Cov(Xp)  »  (|-q0)Op2. 


By  def i n i ti oh, 

(D.  7)  Var(X)  -  E(X2)  -  [E(X)]2 

and 

(D.8)  i,  Var(p)  -  E(p2)  -  [ E (p) ] 2  . 

Now, 

(D.9)  E(X)  -  E (X2)  -  q ( ( 1 -9) p  +  8]  +  (l-q)p, 

(D.  10)  E(p)  *>  p, 


and 

(D.ll) 


E(p2)  -  V2. 


re" 


Substituting  (D.9)  into  (D.7)  and  (D. 10)  and  ( D. 1 1 )  into  (D.8)  and  the 
suits  along  with  (D.2)  into  (D.l)  and  simplifying,  we  obtain 


rxp(q,9|a,b) 


( 1  -q9)  a 


which  simplifies  to  (8)  and  upon  setting  q  =  1 ,  to  (3). 


h- i  tem  measurement  validity.  As  before,  let  X  n  represent  an  n-item  test 
obtained  by  sampling  from  the  large  pool  of  items  and  let  p,  of  course, 
represent  ability  level.  Then, 


and 
(D. 12) 


X«"»*  P 


r(n) 


+  a  +  2r 
P 


x(n)P  ’x(n)’p 


a 


2 

P 


2  a 

X 


(n)°p 


Upon  expanding  the  right  hand  side  of  ( D . 12)  down  to  the  level  of  in¬ 
dividual  items  and  realizing  that  all  item  variances,  o£;  item  inter¬ 
correlations,  rxy;  and  i tern  va 1  id i t ies ,  rxp,  are  equal,  (D.12)  becomes 


r 

X 


(n) 


P 


2nr  DO 

_ XP  .x  P...  _ _ 

2Vn"  o  o  /I  +  (n- 1 )  r 

x  p  xx 


which  simplifies  to 


(0.13) 


r  ^rT 


-  XP— 
/]  +  (n- 1 ) 


r 

xx 


This  final  equation  ( D. 13)  then  becomes  cither  (*0  or  (9)  depending  upon 
the  identification  of  rxp  and  rxx. 


E.  TEST  RELIABILITY  WHEN  VALIDITY  IS  ZERO 

Let  p  be  independent  from  item  to  item  and  proceed  as  in  Appendix  C  to 
obta  i  n 


< 


p22  =  ‘iK'-fiJPx  +  0H(i-0)py  +  e]  +  ( i  ~q )  PxPy 
2 

=  qp6  +  ( l-q)p2 

and 

p£  ■  qp0  +  0-q)p 

which,  when  substituted  into  ( C . 3 )  yields  (12). 

The  procedures  of  Appendix  D  may  be  followed  to  obtain  the  validity  of  such 
a  test.  Notice  that 

E(Xp)  =  p  (q  ( ( 1  *  6)  p  +  0]  +  (l-q)p) 

“  E(p)E(X) 

so  that 

Cov  Xp  =  E ( Xp)  -  E (X) E ( p) 

-  E(X)E(p)  -  E(X)E(p) 
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Logic  and  mathematics  are  employed  to  yield  very  conservative  estimates  of  the 
gains  resulting  from  changing  over  from  choice  methods  to  admissible  probability 
measurement  in  the  administration  of  existing  tests. 

Equations  and  graphs  give  test  reliability  and  measurement  validity  as  a  function 
of  the  distribution  of  ability  levels  in  the  population  to  be  tested  and  as  a  function 
of  the  amount  and  type  of  guessing  engaged  in  by  this  population.  Since  guessing 
degrades  the  performance  of  choice  tests  and  since  the  use  of  admissible  probability 
measurement  eliminates  guessing,  the  extent  of  degradation  corresponds  to  a  con¬ 
servative  estimate  of  the  gains  resulting  from  conversion  to  admissible  probability 
measurement. 

In  som.  applications  it  may  be  wise  to  trade  off  the  increase  in  measurement  valid¬ 
ity  against  the  advantages  of  shortening  the  length  of  the  test.  Equations  and 
graphs  show  how  much  shorter  the  new  guessing-free  test  can  be  and  still  retain  the 
original  measurement  validity. 

Additional  equations  and  curves  show  that  choice  tests  with  zero  measurement 
validities  can  have  appreciable  reliabilities  due  to  differences  in  guessing 
strategy  in  the  population. 

All  the  analyses  indicate  that  conversion  to  admissible  probability  measurement 
will  yield  quite  significant  improvements  in  measurement  validity  along  with  con¬ 
siderable  reductions  in  test  length. 
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